Skip to content

[not for land] DDP regression tests#2613

Open
HDCharles wants to merge 1 commit intomainfrom
awq-ddp-regression-tests
Open

[not for land] DDP regression tests#2613
HDCharles wants to merge 1 commit intomainfrom
awq-ddp-regression-tests

Conversation

@HDCharles
Copy link
Copy Markdown
Collaborator

Summary

  • Comprehensive test framework comparing pre-DDP (2ab02443) vs post-DDP (0bc916e5) AWQ quality across 5 models, 3 quantization formats, and 3 evaluation benchmarks
  • Example-style quantization scripts for each model with custom AWQ mappings where needed (Mixtral MoE w1/w2/w3, Llama-4-Scout multi-submodel scoping)
  • Bash harness (run_all_tests.sh) with resumable execution, progressive results table, and automatic model cleanup

Test Matrix

  • Models: Llama-3-8B, Qwen3-VL-8B, Llama-4-Scout-17B, Qwen2.5-32B, Mixtral-8x7B
  • Schemes: W4A16_ASYM, W8A8, W8A16
  • Benchmarks: gsm8k (5-shot), wikitext (0-shot), mmlu (5-shot)
  • Code states: pre-DDP, post-DDP

Usage

./run_all_tests.sh 2>&1 | tee regression_results.log
python extract_log_summary.py regression_results.log

Test plan

  • Smoke tested quantization at both commits with multiple schemes
  • Verified vLLM eval with chat template and fewshot_as_multiturn
  • Verified AWQ mapping patterns match correct layers for all models
  • Confirmed eval skip and model cleanup logic works

🤖 Generated with Claude Code

Comprehensive test framework comparing pre-DDP vs post-DDP AWQ quality
across 5 models, 3 quantization formats (W4A16_ASYM, W8A8, W8A16), and
3 benchmarks (gsm8k, wikitext, mmlu) via vLLM lm_eval.

Models: Llama-3-8B, Qwen3-VL-8B, Llama-4-Scout-17B, Qwen2.5-32B, Mixtral-8x7B
Features:
- Skips quantization/eval if results already exist (resumable)
- Cleans up quantized models after eval to save disk
- vLLM eval fallback chain (TP -> expert_parallel -> TP=1 -> eager -> hf)
- Applies chat template and fewshot_as_multiturn for eval
- Progressive results summary table with extracted metrics
- Custom AWQ mappings for Mixtral (w1/w2/w3) and Llama-4-Scout (scoped to language_model)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@HDCharles HDCharles changed the title [AWQ] Add DDP regression test suite [not for land] DDP regression tests Apr 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive regression test suite for AWQ quantization, specifically designed to compare pre-DDP and post-DDP code states. It includes individual test scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5, and Qwen 3 VL, alongside a log extraction utility and an automation script. Feedback focuses on improving the portability and efficiency of the automation script, resolving a documentation contradiction regarding model cleanup, and ensuring the suite actually exercises DDP code paths by correctly configuring multi-GPU quantization.

Comment thread run_all_tests.sh
# ── Helper: activate environment ─────────────────────────────────────────────

activate_env() {
source /home/HDCharles/rhdev/bin/activate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The virtual environment activation path is hardcoded to a specific user's home directory (/home/HDCharles/...). This makes the script non-portable and will cause it to fail for any other user or in a CI environment.

Comment thread run_all_tests.sh
Comment on lines +35 to +41
MODEL_VLLM_ARGS=(
"2048,1,1"
"4096,1,1"
"4096,2,1"
"4096,2,1"
"2048,2,1"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The num_gpus_quant parameter (the third value in each entry of MODEL_VLLM_ARGS) is set to 1 for all models. This means the script never executes the torchrun path (line 279) and only performs single-GPU quantization. This prevents the "DDP regression test suite" from actually exercising Distributed Data Parallel (DDP) code paths in llmcompressor, which is the primary objective of this PR.

Comment thread run_all_tests.sh
# ./run_all_tests.sh 2>&1 | tee regression_results.log
# python extract_log_summary.py regression_results.log
#
# Models are saved to disk and NOT cleaned up for follow-up evaluation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a contradiction between this header comment, which states that models are "NOT cleaned up", and the implementation at line 335, which explicitly removes the quantized model directory using rm -rf "$save_dir".

Comment thread run_all_tests.sh
Comment on lines +254 to +255
for scheme in "${SCHEMES[@]}"; do
for code_state in "${CODE_STATES[@]}"; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current loop structure is inefficient. It iterates through models and schemes, and for each combination, it switches the git branch and re-installs the package (pip install -e .). This results in 30 separate installations (5 models * 3 schemes * 2 states). Reordering the loops to checkout each code state once and run all applicable models/schemes for that state would significantly reduce execution time.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Summary by CodeRabbit

New Features

  • Added AWQ quantization example scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5 32B, and Qwen 3 VL models with configurable quantization schemes and calibration options.

Tests

  • Added regression test infrastructure including a log analyzer utility and comprehensive test runner for multi-model evaluation.

Walkthrough

Introduces a comprehensive AWQ quantization regression test suite with model-specific quantization scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5-32B, and Qwen 3 Vision-Language models, along with automated testing orchestration and result summarization utilities.

Changes

Cohort / File(s) Summary
AWQ Quantization Regression Scripts
examples/awq/regression_tests/llama3_awq.py, examples/awq/regression_tests/llama4_scout_awq.py, examples/awq/regression_tests/mixtral_awq.py, examples/awq/regression_tests/qwen25_32b_awq.py, examples/awq/regression_tests/qwen3_vl_awq.py
Model-specific quantization test scripts implementing CLI argument parsing (--scheme, --save-dir, --num-samples), dataset loading/preprocessing with model-specific chat templates, AWQModifier construction with targeted ignore patterns and custom mappings, quantization execution via oneshot(), and model generation testing followed by compressed model saving.
Testing Automation & Orchestration
run_all_tests.sh
Bash orchestration script performing multi-model DDP regression testing across multiple schemes and code states, managing quantization execution via git checkout, coordinating evaluation tasks with vLLM, tracking results in CSV, implementing fallback evaluation strategies, and producing summary statistics.
Log Processing Utility
extract_log_summary.py
Python script parsing structured regression log sections to extract timing, GPU memory, and quantization metrics (GSM8K Flex/Strict, Wiki Perplexity, MMLU accuracy), generating formatted summary tables with pre/post-DDP comparisons and delta calculations.

Sequence Diagram

sequenceDiagram
    participant User
    participant Script as AWQ Script
    participant Model
    participant Dataset
    participant AWQModifier
    participant Quantizer
    participant Generator

    User->>Script: Execute with scheme & num_samples
    Script->>Model: Load pretrained model & tokenizer
    Script->>Dataset: Load calibration dataset
    Script->>Dataset: Preprocess (chat template & tokenize)
    Script->>AWQModifier: Construct recipe with scheme & ignore rules
    Script->>Quantizer: Execute oneshot(dataset, recipe, config)
    Quantizer->>Model: Apply quantization
    Quantizer-->>Script: Return quantized model
    Script->>Model: Dispatch for execution
    Script->>Generator: Generate sample text
    Generator-->>Script: Return generated tokens
    Script-->>User: Save quantized model & report metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

awq, enhancement, qwen, llama

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[not for land] DDP regression tests' is directly related to the changeset, which adds a comprehensive DDP regression test suite with quantization scripts and a bash harness.
Description check ✅ Passed The description clearly explains the purpose of the PR: a regression test framework comparing pre-DDP and post-DDP AWQ quality, including test matrix, models, schemes, benchmarks, usage instructions, and test plan verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch awq-ddp-regression-tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added enhancement New feature or request qwen For any PR / issue related to Qwen support llama For any PR / issue related to Llama herd support awq For any issue / PR related to AWQ support labels Apr 13, 2026
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/awq/regression_tests/llama3_awq.py`:
- Around line 29-30: The model-loading call using
AutoModelForCausalLM.from_pretrained currently passes dtype="auto" which is
ignored; change the parameter name to torch_dtype="auto" so the model is loaded
with the correct precision (e.g., update the call in llama3_awq.py where
AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") is used and make
the identical change in qwen25_32b_awq.py for its
AutoModelForCausalLM.from_pretrained call; keep the
AutoTokenizer.from_pretrained(..., trust_remote_code=True) calls unchanged.

In `@examples/awq/regression_tests/llama4_scout_awq.py`:
- Around line 19-128: Run ruff format on the file to fix the style errors
reported by CI: run `ruff format` (or your editor's ruff formatting) and stage
the resulting changes; focus on formatting inside the main() block and helper
functions (main, preprocess_function, data_collator, and the
AWQModifier/AWQMapping recipe) so the file passes `ruff format --check` and the
CI quality job turns green. Ensure you do not alter logic—only whitespace, line
breaks, and import/style formatting—then commit the formatted file.

In `@examples/awq/regression_tests/qwen25_32b_awq.py`:
- Around line 29-30: The model is being loaded with the wrong keyword argument:
replace the incorrect dtype="auto" used in AutoModelForCausalLM.from_pretrained
with the correct torch_dtype="auto" so the 32B model loads in the intended
precision; update the call to AutoModelForCausalLM.from_pretrained(MODEL_ID,
torch_dtype="auto") (keeping AutoTokenizer.from_pretrained(MODEL_ID,
trust_remote_code=True) unchanged).

In `@examples/awq/regression_tests/qwen3_vl_awq.py`:
- Around line 103-127: The smoke test currently pulls a remote image URL inside
the messages list (the "image": "http://images.cocodataset.org/...") which
introduces a network dependency that can fail before quantization is saved;
update the messages construction so it uses a local/test image already present
in the harness (e.g., reuse an existing dataset image buffer or path used
earlier) or make the generation step best-effort by wrapping the
vision-processing/generation sequence (process_vision_info, processor(...),
model.generate) in a try/except that logs the failure and continues without
marking the quantization as failed; target the messages variable,
process_vision_info invocation, the processor(...) call and the
model.generate(...) call when applying the change.

In `@extract_log_summary.py`:
- Around line 41-50: The current code searches the whole content for the
MODEL/CODE STATE pattern so repeats reuse the first global match; instead
restrict the search to the current scheme's block: first locate the scheme
header for this iteration (e.g., find the scheme-start boundary in content),
then run re.search(pattern, scheme_block, re.DOTALL) so match, start,
next_section and body are computed relative to that scheme_block (not the full
content); update the logic around pattern, match, start, next_section and body
to slice content to the scheme range before extracting the MODEL and CODE STATE
section.

In `@run_all_tests.sh`:
- Around line 267-269: The cleanup currently removes $save_dir unconditionally
which defeats the earlier quantization skip; update the flow in run_all_tests.sh
so that before doing any quantization/eval work you check for the presence of
the final evaluation outputs (results_*.json) and, if they already exist, skip
the entire quantize+eval block (the code guarded by the $save_dir existence
check and the later removal code around lines referencing results_*.json and the
rm -rf $save_dir cleanup); alternatively, defer deleting $save_dir until after
you confirm all eval outputs have been produced—ensure this change touches the
quantize check that uses $save_dir and the later cleanup block that deletes
$save_dir so resumable runs do not force re-quantization.
- Around line 58-60: The activate_env() function currently hardcodes a personal
path (/home/HDCharles/rhdev/bin/activate); change it to not assume that path by
either: 1) removing the function and documenting that the caller must activate
the virtualenv, or 2) making activate_env() read a venv path from an environment
variable (e.g. VENV_ACTIVATE or VENV_PATH) or a CLI argument, validate that the
file exists before sourcing, and emit a clear error and exit if not set/invalid;
update any callers to pass the variable or rely on the caller-activated
environment.
- Around line 204-215: The current switch_code_state() function checks out
historical commits in-place which removes the test harness and newly added
regression scripts (examples/awq/regression_tests/*); change the approach to
avoid in-place checkouts by either (A) using git worktree or a fresh clone:
create a separate worktree/clone for PRE_DDP_COMMIT and POST_DDP_COMMIT, run pip
install -e . in those isolated directories and invoke the regression scripts
from those worktrees, or (B) before any checkout, copy the harness and
examples/awq/regression_tests/* to a temporary directory and run the tests from
that safe copy; update switch_code_state() and the test invocation logic to
reference the worktree/temporary copy instead of the current working tree so the
scripts introduced by the PR are never removed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 85ec6662-14e8-47f9-9c74-4a521f89aaa5

📥 Commits

Reviewing files that changed from the base of the PR and between 9d328ed and c13fce8.

📒 Files selected for processing (7)
  • examples/awq/regression_tests/llama3_awq.py
  • examples/awq/regression_tests/llama4_scout_awq.py
  • examples/awq/regression_tests/mixtral_awq.py
  • examples/awq/regression_tests/qwen25_32b_awq.py
  • examples/awq/regression_tests/qwen3_vl_awq.py
  • extract_log_summary.py
  • run_all_tests.sh

Comment on lines +29 to +30
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the actual code at the specified lines
echo "=== Checking llama3_awq.py lines 25-35 ==="
cat -n examples/awq/regression_tests/llama3_awq.py | sed -n '25,35p'

echo ""
echo "=== Checking qwen25_32b_awq.py ==="
if [ -f "examples/awq/regression_tests/qwen25_32b_awq.py" ]; then
  grep -n "AutoModelForCausalLM.from_pretrained" examples/awq/regression_tests/qwen25_32b_awq.py
else
  echo "File not found"
fi

echo ""
echo "=== Checking llama4_scout_awq.py ==="
if [ -f "examples/awq/regression_tests/llama4_scout_awq.py" ]; then
  grep -n "AutoModelForCausalLM.from_pretrained" examples/awq/regression_tests/llama4_scout_awq.py
else
  echo "File not found"
fi

Repository: vllm-project/llm-compressor

Length of output: 697


🌐 Web query:

AutoModelForCausalLM.from_pretrained dtype parameter torch_dtype hugging face transformers

💡 Result:

The torch_dtype parameter in AutoModelForCausalLM.from_pretrained specifies the PyTorch data type (dtype) for loading the model weights, overriding the default torch.float32. It accepts a torch.dtype (e.g., torch.float16, torch.bfloat16) or the string "auto". When "auto" is used: - Checks config.json for a torch_dtype or dtype entry. - If absent, infers from the dtype of the first floating-point weight in the checkpoint. Examples: import torch from transformers import AutoModelForCausalLM # Load with specific dtype model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.bfloat16) # Auto-detect dtype model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype="auto") Note: torch_dtype primarily affects parameters; some buffers with explicit dtypes (e.g., torch.float32) remain unchanged. In recent versions (post-v4.XX), it is also referred to as dtype in examples, but torch_dtype remains valid. This parameter helps reduce memory usage (e.g., FP16/BF16) and is passed via kwargs to the underlying PreTrainedModel.from_pretrained method.

Citations:


Change dtype parameter to torch_dtype in model-loading calls.

Line 29 uses dtype="auto" but the Transformers API requires torch_dtype="auto". Using the incorrect parameter name causes it to be ignored, loading the model at default precision (float32) instead of auto-detecting the optimal dtype. This undermines the regression test by producing invalid memory and timing measurements.

The same issue exists in examples/awq/regression_tests/qwen25_32b_awq.py on line 29.

Suggested changes
-    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

Apply the same fix to qwen25_32b_awq.py.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/awq/regression_tests/llama3_awq.py` around lines 29 - 30, The
model-loading call using AutoModelForCausalLM.from_pretrained currently passes
dtype="auto" which is ignored; change the parameter name to torch_dtype="auto"
so the model is loaded with the correct precision (e.g., update the call in
llama3_awq.py where AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
is used and make the identical change in qwen25_32b_awq.py for its
AutoModelForCausalLM.from_pretrained call; keep the
AutoTokenizer.from_pretrained(..., trust_remote_code=True) calls unchanged.

Comment on lines +19 to +128
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--scheme", default="W4A16_ASYM")
parser.add_argument("--save-dir", default=None)
parser.add_argument("--num-samples", type=int, default=NUM_CALIBRATION_SAMPLES)
args = parser.parse_args()

num_samples = args.num_samples

model = Llama4ForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
processor = Llama4Processor.from_pretrained(MODEL_ID)

ds = load_dataset(
DATASET_ID, name="LLM", split=f"train[:{num_samples}]"
)

def preprocess_function(example):
messages = []
for message in example["messages"]:
messages.append(
{
"role": message["role"],
"content": [{"type": "text", "text": message["content"]}],
}
)

return processor.apply_chat_template(
messages,
return_tensors="pt",
padding=False,
truncation=True,
max_length=MAX_SEQUENCE_LENGTH,
tokenize=True,
add_special_tokens=False,
return_dict=True,
add_generation_prompt=False,
)

ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)

def data_collator(batch):
assert len(batch) == 1
return {
key: (
torch.tensor(value)
if key != "pixel_values"
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
)
for key, value in batch[0].items()
}

# Llama-4-Scout has both vision_model and language_model sub-models,
# so mappings must be scoped to language_model to avoid dual matches.
# The main experts use a fused gate_up_proj (not Linear), so only
# shared_expert Linear layers are AWQ targets.
recipe = AWQModifier(
targets="Linear",
scheme=args.scheme,
ignore=[
"re:.*lm_head",
"re:.*self_attn",
"re:.*router",
"re:.*vision_model.*",
"re:.*multi_modal_projector.*",
"Llama4TextAttention",
],
mappings=[
AWQMapping(
"re:.*language_model.*post_attention_layernorm$",
[
"re:.*shared_expert.gate_proj$",
"re:.*shared_expert.up_proj$",
],
),
AWQMapping(
"re:.*shared_expert.up_proj$",
["re:.*shared_expert.down_proj$"],
),
],
)

torch.cuda.reset_peak_memory_stats()
start_time = time.time()

oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=num_samples,
data_collator=data_collator,
sequential_targets=["Llama4TextMLP"],
)

elapsed_time = time.time() - start_time
peak_memory_gb = torch.cuda.max_memory_allocated() / (1024**3)
print("Quantization Complete")
print(f"Time: {elapsed_time / 60:.2f} minutes ({elapsed_time:.2f} seconds)")
print(f"Peak GPU Memory: {peak_memory_gb:.2f} GB")

save_dir = args.save_dir or (
MODEL_ID.rstrip("/").split("/")[-1] + f"-{args.scheme}"
)
model.save_pretrained(save_dir, save_compressed=True)
processor.save_pretrained(save_dir)
print(f"Model saved to {save_dir}")


if __name__ == "__main__":
main()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Run ruff format on this file before merge.

CI is already red on examples/awq/regression_tests/llama4_scout_awq.py due to ruff format --check, so this file needs a formatting pass to get the quality job green.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/awq/regression_tests/llama4_scout_awq.py` around lines 19 - 128, Run
ruff format on the file to fix the style errors reported by CI: run `ruff
format` (or your editor's ruff formatting) and stage the resulting changes;
focus on formatting inside the main() block and helper functions (main,
preprocess_function, data_collator, and the AWQModifier/AWQMapping recipe) so
the file passes `ruff format --check` and the CI quality job turns green. Ensure
you do not alter logic—only whitespace, line breaks, and import/style
formatting—then commit the formatted file.

Comment on lines +29 to +30
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the actual content at lines 29-30 in the file
cat -n examples/awq/regression_tests/qwen25_32b_awq.py | sed -n '25,35p'

Repository: vllm-project/llm-compressor

Length of output: 504


🏁 Script executed (no clone):

#!/bin/bash
# Search for Hugging Face Transformers documentation on from_pretrained parameter names
# Check what the correct parameter should be
curl -s "https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/model#transformers.PreTrainedModel.from_pretrained" 2>/dev/null | grep -i "torch_dtype\|dtype" | head -20 || echo "Could not fetch docs directly"

Length of output: 23243


🏁 Script executed:

#!/bin/bash
# Check for other usages of from_pretrained in the codebase to see patterns
rg "from_pretrained.*\(dtype|torch_dtype)" examples/ --type py -A 2 -B 2

Repository: vllm-project/llm-compressor

Length of output: 195


🏁 Script executed:

#!/bin/bash
# Also check the llama3 file mentioned in the scratchpad to see if it has the same issue
cat -n examples/awq/regression_tests/llama3_awq.py | sed -n '25,35p'

Repository: vllm-project/llm-compressor

Length of output: 504


🏁 Script executed:

#!/bin/bash
# Search for other from_pretrained usages to see what pattern is typically used
rg "from_pretrained\(" examples/ --type py -A 1 | head -50

Repository: vllm-project/llm-compressor

Length of output: 3643


🏁 Script executed:

#!/bin/bash
# Check if there are any successful uses of torch_dtype in the repo
rg "torch_dtype" --type py

Repository: vllm-project/llm-compressor

Length of output: 2328


Use the correct torch_dtype parameter for model loading.

Line 29 uses dtype="auto", but Transformers' from_pretrained() API expects torch_dtype="auto". The incorrect parameter name is silently ignored, causing the 32B model to load in the wrong precision and invalidating regression test results.

Suggested change
-    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/awq/regression_tests/qwen25_32b_awq.py` around lines 29 - 30, The
model is being loaded with the wrong keyword argument: replace the incorrect
dtype="auto" used in AutoModelForCausalLM.from_pretrained with the correct
torch_dtype="auto" so the 32B model loads in the intended precision; update the
call to AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
(keeping AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
unchanged).

Comment on lines +103 to +127
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://images.cocodataset.org/train2017/000000231895.jpg",
},
{"type": "text", "text": "Please describe the animal in this image\n"},
],
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[prompt],
images=image_inputs,
videos=video_inputs,
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove the network dependency from the smoke test.

Lines 109-124 pull a remote image for the post-quantization sanity check, and this runs before Lines 133-135 save the quantized artifact. In the harness, a failure here marks the whole block as QUANT_FAILED, so a flaky network can discard a successful quantization for reasons unrelated to AWQ. Reuse one of the already-loaded dataset images or make this generation step best-effort instead.
As per coding guidelines, examples/**/*.py: "Ensure examples follow best practices and are runnable."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/awq/regression_tests/qwen3_vl_awq.py` around lines 103 - 127, The
smoke test currently pulls a remote image URL inside the messages list (the
"image": "http://images.cocodataset.org/...") which introduces a network
dependency that can fail before quantization is saved; update the messages
construction so it uses a local/test image already present in the harness (e.g.,
reuse an existing dataset image buffer or path used earlier) or make the
generation step best-effort by wrapping the vision-processing/generation
sequence (process_vision_info, processor(...), model.generate) in a try/except
that logs the failure and continues without marking the quantization as failed;
target the messages variable, process_vision_info invocation, the processor(...)
call and the model.generate(...) call when applying the change.

Comment thread extract_log_summary.py
Comment on lines +41 to +50
# Find the body for this section
pattern = re.escape(f"MODEL: {model}") + r".*?" + re.escape(f"CODE STATE: {code_state}")
match = re.search(pattern, content, re.DOTALL)
if not match:
continue
# Get everything after the match until the next section
start = match.end()
next_section = re.search(r"║\s+MODEL:", content[start:])
end = start + next_section.start() if next_section else len(content)
body = content[start:end]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

This re-search picks the wrong section body for repeated schemes.

Lines 42-49 search the full log again using only MODEL and CODE STATE. Each model/code-state pair appears once per scheme, so every later scheme will reuse the first matching block and get parsed with the wrong metrics. That makes the summary table incorrect for repeated model/scheme combinations.

Suggested fix
-    sections = section_pattern.findall(content)
+    sections = list(section_pattern.finditer(content))
@@
-    for model, scheme, code_state in sections:
-        model = model.strip()
-        scheme = scheme.strip()
-        code_state = code_state.strip()
-
-        # Find the body for this section
-        pattern = re.escape(f"MODEL: {model}") + r".*?" + re.escape(f"CODE STATE: {code_state}")
-        match = re.search(pattern, content, re.DOTALL)
-        if not match:
-            continue
-        # Get everything after the match until the next section
-        start = match.end()
-        next_section = re.search(r"║\s+MODEL:", content[start:])
-        end = start + next_section.start() if next_section else len(content)
-        body = content[start:end]
+    for idx, match in enumerate(sections):
+        model, scheme, code_state = (part.strip() for part in match.groups())
+        start = match.end()
+        end = sections[idx + 1].start() if idx + 1 < len(sections) else len(content)
+        body = content[start:end]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extract_log_summary.py` around lines 41 - 50, The current code searches the
whole content for the MODEL/CODE STATE pattern so repeats reuse the first global
match; instead restrict the search to the current scheme's block: first locate
the scheme header for this iteration (e.g., find the scheme-start boundary in
content), then run re.search(pattern, scheme_block, re.DOTALL) so match, start,
next_section and body are computed relative to that scheme_block (not the full
content); update the logic around pattern, match, start, next_section and body
to slice content to the scheme range before extracting the MODEL and CODE STATE
section.

Comment thread run_all_tests.sh
Comment on lines +58 to +60
activate_env() {
source /home/HDCharles/rhdev/bin/activate
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hardcode a personal virtualenv path.

Line 59 assumes /home/HDCharles/rhdev/bin/activate exists. Everywhere except that workstation, source fails before any regression runs start. Let the caller activate the environment first, or take the venv path from an env var / CLI argument.

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 59-59: Not following: /home/HDCharles/rhdev/bin/activate was not specified as input (see shellcheck -x).

(SC1091)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run_all_tests.sh` around lines 58 - 60, The activate_env() function currently
hardcodes a personal path (/home/HDCharles/rhdev/bin/activate); change it to not
assume that path by either: 1) removing the function and documenting that the
caller must activate the virtualenv, or 2) making activate_env() read a venv
path from an environment variable (e.g. VENV_ACTIVATE or VENV_PATH) or a CLI
argument, validate that the file exists before sourcing, and emit a clear error
and exit if not set/invalid; update any callers to pass the variable or rely on
the caller-activated environment.

Comment thread run_all_tests.sh
Comment on lines +204 to +215
switch_code_state() {
local state=$1

if [ "$state" == "pre-ddp" ]; then
echo "Switching to pre-DDP code ($PRE_DDP_COMMIT)..."
git checkout "$PRE_DDP_COMMIT" 2>&1
pip install -e . --quiet 2>&1
elif [ "$state" == "post-ddp" ]; then
echo "Switching to post-DDP code ($POST_DDP_COMMIT)..."
git checkout "$POST_DDP_COMMIT" 2>&1
pip install -e . --quiet 2>&1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Checking out the target commits here removes the very scripts you're about to run.

This harness runs from the current PR branch, but Lines 207-214 switch the repo to historical commits before Lines 278-283 invoke examples/awq/regression_tests/*.py. Because those regression scripts are introduced by this PR, the checkout will remove or replace them from the working tree, so the subsequent quantization command can fail with “file not found” or run the wrong script contents. Use separate worktrees/clones for the pre/post code states, or copy the harness scripts outside the repo before switching revisions.

Also applies to: 278-283

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run_all_tests.sh` around lines 204 - 215, The current switch_code_state()
function checks out historical commits in-place which removes the test harness
and newly added regression scripts (examples/awq/regression_tests/*); change the
approach to avoid in-place checkouts by either (A) using git worktree or a fresh
clone: create a separate worktree/clone for PRE_DDP_COMMIT and POST_DDP_COMMIT,
run pip install -e . in those isolated directories and invoke the regression
scripts from those worktrees, or (B) before any checkout, copy the harness and
examples/awq/regression_tests/* to a temporary directory and run the tests from
that safe copy; update switch_code_state() and the test invocation logic to
reference the worktree/temporary copy instead of the current working tree so the
scripts introduced by the PR are never removed.

Comment thread run_all_tests.sh
Comment on lines +267 to +269
# ── Quantize (skip if model already exists) ────────────
if [ -d "$save_dir" ] && [ -f "$save_dir/config.json" ]; then
echo "Quantized model already exists at $save_dir, skipping quantization."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The cleanup step defeats the resumable quantization skip.

Lines 268-269 only skip quantization while $save_dir still exists, but Lines 333-336 delete it unconditionally after every block. On a resumed run, the harness will therefore re-quantize every model/scheme/state even when Lines 309-317 are about to skip all evaluations because results_*.json already exist. If resumability is a goal, skip the whole block up front when all eval outputs are present, or keep the quantized model until that block is fully reusable.

Also applies to: 309-317, 332-336

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run_all_tests.sh` around lines 267 - 269, The cleanup currently removes
$save_dir unconditionally which defeats the earlier quantization skip; update
the flow in run_all_tests.sh so that before doing any quantization/eval work you
check for the presence of the final evaluation outputs (results_*.json) and, if
they already exist, skip the entire quantize+eval block (the code guarded by the
$save_dir existence check and the later removal code around lines referencing
results_*.json and the rm -rf $save_dir cleanup); alternatively, defer deleting
$save_dir until after you confirm all eval outputs have been produced—ensure
this change touches the quantize check that uses $save_dir and the later cleanup
block that deletes $save_dir so resumable runs do not force re-quantization.

@mergify mergify Bot added the documentation Improvements or additions to documentation label Apr 13, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 13, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awq For any issue / PR related to AWQ support documentation Improvements or additions to documentation enhancement New feature or request llama For any PR / issue related to Llama herd support quality-failed qwen For any PR / issue related to Qwen support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant